A Theoretical Framework for Self-Supervised MR Image Reconstruction Using Sub-Sampling via Variable Density Noisier2Noise

In recent years, there has been attention on leveraging the statistical modeling capabilities of neural networks for reconstructing sub-sampled Magnetic Resonance Imaging (MRI) data. Most proposed methods assume the existence of a representative fully-sampled dataset and use fully-supervised training. However, for many applications, fully sampled training data is not available, and may be highly impractical to acquire. The development and understanding of self-supervised methods, which use only sub-sampled data for training, are therefore highly desirable. This work extends the Noisier2Noise framework, which was originally constructed for self-supervised denoising tasks, to variable density sub-sampled MRI data. We use the Noisier2Noise framework to analytically explain the performance of Self-Supervised Learning via Data Undersampling (SSDU), a recently proposed method that performs well in practice but until now lacked theoretical justification. Further, we propose two modifications of SSDU that arise as a consequence of the theoretical developments. Firstly, we propose partitioning the sampling set so that the subsets have the same type of distribution as the original sampling mask. Secondly, we propose a loss weighting that compensates for the sampling and partitioning densities. On the fastMRI dataset we show that these changes significantly improve SSDU’s image restoration quality and robustness to the partitioning parameters.

There has been significant research attention in recent years on methods that reconstruct sub-sampled MRI data with neural networks [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. The majority of these works use fully-supervised training. To train a network in a fullysupervised manner, there must be a dataset comprised of fully sampled k-space data y 0,t ∈ C N , where N is the dimension of k-space multiplied by the number of coils, and paired subsampled data y t = M Ω t y 0,t . Here, t indexes the training set and M Ω t ∈ R N ×N is a sub-sampling mask with sampling set Ω t , so that the jth diagonal of M Ω t is 1 if j ∈ Ω t and zero otherwise.
Then a network f θ with parameters θ is trained by seeking a minimum of a non-convex loss function: which could be, for example, an p norm in the image domain after coil combination [25]. The network fθ estimates the ground truth in the image domain or k-space depending on the choice of loss function. For a k-space to k-space network, y 0,s can be estimated withŷ s = fθ(y s ), where s indexes the test set. Given sufficient representative training data, fully-supervised networks can yield substantial reconstruction quality gains over sparsity-based compressed sensing methods. There are a number of large datasets available for fully supervised training, such as the fastMRI knee and brain data [25]. However, for many contrasts, orientations, or anatomical regions of interest, fully sampled datasets are not publicly available. Fully sampled data is rarely acquired as part of a normal scanning protocol, so acquiring sufficient training data for a specific application is highly resource intensive. In some cases, it may not even be technically feasible to acquire such data [26], [27], [28]. Therefore, for MRI reconstruction with deep learning to be applicable to datasets acquired using only standard protocols, a training method that uses solely sub-sampled data is required. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ There have been several attempts to train networks with only sub-sampled MRI data [29], [30], [31], [32], [33], [34], [35], [36], [37], some of which are based on methods from the denoising literature [38], [39], [40], [41], [42], [43], [44]. One such approach is Noise2Noise [38]. Rather than mapping y t to y 0,t , Noise2Noise trains a network to map y t to another sub-sampled k-space y T = M Ω T y 0,T where Ω T and Ω t are independent and y 0,T = y 0,t when t = T [31]. A limitation of Noise2Noise is that it requires paired data, so the dataset must contain two independently sampled scans of the same k-space [14], which is not part of standard protocols. Further, unless compensated for [45], any motion and phase drifts between scans would cause the paired data to be inconsistent, violating the central assumption that underlies the method.
SSDU [33] is a recently proposed method for ground-truth free training that does not require paired data. SSDU partitions the sampling set Ω t into two disjoint sets: At inference, the estimate fθ(y s ) is used. With a physics-guided network architecture, SSDU was found to have a reconstruction quality comparable with fully supervised training given certain empirically selected choices of A t and B t . However, it was presented without theoretical justification. Although SSDU has similarities with Noise2Self [40], Noise2Self's analysis has a strong requirement on independent noise, so do not apply to k-space sampling in general.

A. Contributions
This article considers the recently proposed Noisier2Noise framework [41], which was originally constructed for denoising problems. We modify Noisier2Noise so that it can be applied to variable density sub-sampled MRI data. To our knowledge, this is the first work that applies Noisier2Noise to image reconstruction. Like SSDU, the proposed modification of Noisier2Noise does not require paired data, and involves training a network to map from one subset of Ω t to another. While SSDU recovers one disjoint set from the other, Noisier2Noise applies a second sub-sampling mask to the data, y t = M Λ t y t = M Λ t M Ω t y 0,t , and the network is trained to recover y t from y t with an 2 loss. Then, at inference, the fully sampled data is estimated via a correction term based on the distributions of Λ t and Ω t that ensures that the estimate is correct in expectation.
Despite their superficial differences, we show that, in fact, SSDU and Noisier2Noise are closely related. Specifically, we demonstrate that SSDU is a version of Noisier2Noise with a particular loss function modification that removes the need for the correction term at inference. The primary contribution of this article is the use of Noisier2Noise to theoretically explain SSDU's excellent empirical performance. Specifically, we show that SSDU with an 2 loss correctly estimates fully sampled k-space in expectation: see Section II-D.
The second contribution of this article is the proposal of two modifications of SSDU that significantly improve its reconstruction quality and robustness to the parameters of M Λ t , both of which arise as a consequence of SSDU's connection to Noisier2Noise. Firstly, we use Noisier2Noise to inform SSDU's sampling set partition: we show that SSDU's performance improves when B t has the same type of distribution as the original mask Ω t , but not necessarily with the same parameters. Secondly, we show that SSDU's performance improves when a particular weighting is employed in the loss function. This non-trivial weighting, which arises as a consequence of the novel theoretical analysis of SSDU, depends on the distributions of Λ t and Ω t and has minimal additional computational cost: see Section II-F.
Although this paper focuses on MRI reconstruction, we emphasize that none of the theoretical developments are specific to k-space. This framework is therefore applicable to any image reconstruction problem with a forward model that involves random sub-sampling, such as low dose x-ray computed tomography [46] or astronomical imaging [47].

II. THEORY
This section describes how the Noisier2Noise framework can be applied to sub-sampled data. Additive and multiplicative noise versions of Noisier2Noise are proposed in [41]. Based on the observation that a k-space sub-sampling mask can be considered as multiplicative "noise", we extend Noisier2Noise to image reconstruction by modifying the latter. It is standard practice in MRI to sub-sample k-space with variable density, so that low frequencies, where the spectral density is larger, are sampled with higher probability [7]. Since the multiplicative noise version of standard Noisier2Noise assumes uniformity, this requires a modification of the framework to variable density sampling.

A. Variable Density Noisier2Noise for Reconstruction
The terms in the measurement model y t = M Ω t y 0,t can be considered as instances of random variables. We denote Y = M Ω Y 0 , where Y , M Ω and Y 0 are the random variables corresponding to y t , M Ω t , and y 0,t respectively. Now consider the multiplication of Y by a second mask represented by the random variable M Λ , so that Y is a further sub-sampled random variable. The following result states how the expectation of Y 0 can be computed from Y and Y . Here, and throughout this article, E[·] is used to denote the expectation over all random variables within the brackets.
Claim 1: where K is a diagonal matrix defined as Proof: See Section A of the Appendix, which is based on the proof given in Section III.D of [41].
Equation (3) generalizes the version of Noisier2Noise proposed for uniform, multiplicative noise in [41] to variable density sampling. The difference between the uniform and variable density versions is the matrix K, which is a scalar in [41]. For the special case where M Ω and M Λ are uniformly random sub-sampling masks, P , P and therefore K are proportional to the identity matrix, and (3) simplifies to the uniform version. The mathematical requirement that p j > 0 and p j < 1 for all j simply ensures that (1 − K) is invertible: see Section A of the Appendix.
Equation (3) implies that E[Y 0 | Y ] can be estimated without fully sampled data by training a network to estimate To do this, a network can be trained to minimize for a full-rank matrix W . The minimum occurs when the gradient with respect to θ is zero: where J is the Jacobian matrix with entries J ij = ∂f θ ( Y ) j /∂θ i . The number of parameters is typically much greater than N , so J has far more rows than columns. Assuming that the rows of J are maximally linearly independent, so the row space is N -dimensional, the only solution is If W is full-rank, W H W is also full rank, so left-multiplying by (W H W ) −1 and using Therefore, by (3), a candidate for estimating fully sampled kspace with sub-sampled data only is This expression does not use Y , so does not use all available data. Two candidate approaches for using all available data at inference are considered in this article. Firstly, one can overwrite known entries of the network output with Y : Here, the superscript refers to "data consistent", since the estimate is exactly consistent with Y . We emphasize thatŶ dc is consistent with all available data Y , not just the data in Y . Alternatively, similar to the approaches suggested in both SSDU [33] and the additive noise examples in Noisier2Noise [41], one can use singly sub-sampled k-space Y as the network input at inference: Since Claim 1 applies to (7) is not guaranteed to be correct in expectation. However, it has the advantage that all available data is used by the network. Hence, despite deviating from strict theory, we have found that it performs well in practice: see Section IV. This suggests the following procedure, illustrated in Fig. 1, for training a network without fully-sampled data. For each sub-sampled k-space in the training set y t = M Ω t y 0,t , generate a further sub-sampled k-space where M Λ t is an instance of M Λ . Then, approximate (5) by training a network to minimize the loss function for some full-rank matrix W . During inference, estimate fullysampled k-space with either where s indexes the test set.
In other words, we train a network to estimate the "singly" sub-sampled k-space y t from "doubly" sub-sampled k-space y t and then, during inference, apply a correction based on the diagonal matrix K to estimate the fully sampled data. The correction term only needs to be applied during inference and has minimal computational cost.
In [41], only the version with W = 1 was presented. Here we present a version with non-trivial W because it provides a theoretical link to SSDU; Section II-D shows that Nois-ier2Noise with the rank-deficient W = (1 − M Λ )M Ω is SSDU exactly.
Noisier2Noise and SSDU work because the network cannot deduce from y t which entries of y t are non-zero [41]. Therefore, the loss is minimized when the network learns to recover all of k-space: see Section V for a detailed discussion.

B. Choice of Mask Distributions
The only condition on the first mask M Ω from Claim 1 is that p j > 0 for all j. In other words, the guarantee only applies when there is a non-zero probability that there are sampled examples of all k-space locations in the training set.
Claim 1 also states that the second mask M Λ must obey p j < 1 for all j. This ensures that there is a non-zero probability that any entry of Y is masked. Unlike M Ω , whose distribution is determined by the acquisition protocol, the M Λ is chosen freely during training. Following [41], we suggest using a distribution of M Λ that is the same type as M Ω , but not necessarily with the same parameters. For instance, if M Ω is column-wise sampling with variable density, such as in Fig. 1, an appropriate M Λ is one that is also column-wise, but possibly with a different variable density distribution.

C. Choice of Network
Noisier2Noise is agnostic to the network architecture. We have found that using the data consistent function where g θ ( y t ) is a network with arbitrary architecture, may improve the performance of Noisier2Noise. This is because the g θ ( y t ) in (11) only recovers regions of k-space that are not already sampled in y t , so the network does not need to learn to map sampled k-space locations to themselves. We emphasize that (11) ensures that f θ ( y t ) is consistent with y t , while (9) ensures the estimateŷ dc s is consistent with y s , which is only applied at inference and cannot be part of the network architecture when y s is used as the input.
Many popular network architectures for MRI reconstruction are based on a sequence of "unrolled" iterations of a optimization algorithm [48] such as the Iterative Shrinkage Thresholding Algorithm (ISTA) [49] or the Alternating Direction Method of Multipliers (ADMM) [50]. These are variously known as "physics-guided", "physics-based" or "model-based" methods due to their explicit use of the MRI forward model. These architectures typically alternate between a module that recovers missing k-space entries by removing aliasing in the image domain and a module that ensures consistency with the k-space data. This implies that (11), or possibly a "soft" version of it where the data is not forced to be exactly consistent, may already be implemented as part of the network architecture. In the experimental evaluation of the methods in this article we used the Variational Network (VarNet) [12], [51], which is one such architecture where (11) is not necessary. However, in preliminary studies not presented in this article we found that a U-net [52], which does not already employ data consistency, benefited considerably from (11).

D. Relationship to SSDU
This section shows that SSDU [33] with an 2 loss is a version of Noisier2Noise with a particular rank-deficient loss weighting matrix W .
To see the connection between SSDU and Noisier2Noise, it is instructive to see the relationship between Noisier2Noise's Λ t and SSDU's disjoint subsets A t and B t . Disjoint subsets of Ω t can be formed in terms of Ω t and Λ t by setting A t = Ω t \Λ t and B t = Ω t ∩ Λ t . The distribution of A t and B t are defined by the distributions of Ω t and Λ t and always satisfy A t ∪ B t = Ω t and A t ∩ B t = ∅ as required. In terms of sampling masks, this is written as In other words, while Noiser2Noise's loss is computed over all k-space, SSDU's loss is computed only on indices that are in Ω t but not in Λ t .
SSDU's weighting ensures that any indices not sampled in Y are ignored in the loss. One might think that the correct choice for this goal would be W = M Ω t . However, if a data consistent network is employed, as in (11), the contribution to the loss from indices in both Ω t and Λ t would be zero because they are consistent by construction. Therefore the loss for W = M Ω t and W = (1 − M Λ t )M Ω t would be identical. A similar idea was presented for fully supervised learning in [53], where a mask is applied to the training data multiple times.

E. Proof of SSDU
This section shows that SSDU's loss weighting causes the correction (1 − K) −1 at inference to no longer be necessary. When the weighting matrix W is the random variable (1 − M Λ )M Ω , the network parameters are trained to seek a minimum of . The usual theoretical goal for selfsupervised methods is to prove that the network is correct in expectation [38], [39], [40], [41], [42], [43], [44], as in Claim 1 for variable density Noisier2Noise. In the following we state, to our knowledge, the first similar result for SSDU. Claim 2: A network with parameters that minimizes (12) satisfies Proof: See Section B of the Appendix. If 1 − K is invertible, which holds when p j > 0 and p j < 1 for all j, Therefore, in general, f θ * ( Y ) is correct in expectation, but only in regions of k-space that are not sampled in Y . This contrasts with the variable density Noisier2Noise method presented in Section II-A, which is correct in expectation for all k-space indices. However, as described in the following, this apparent shortcoming can easily be circumvented by using all available data at inference.
Similarly to Noisier2Noise's (9) and (10), we consider two options for the k-space estimate at inference, both of which use all available data. Firstly, similarly to (9), the data consistent estimateŶ can be used, which is correct in expectation everywhere in k-space for any network architecture. Alternatively, the SSDU paper [33] suggests usingŶ and a physics-guided network architecture. Like (10) for Nois-ier2Noise, the network input for (15) is singly sub-sampled, so Claim 2 does not apply and the estimate is not guaranteed to be correct in expectation. Nonetheless, it has the advantage over (14) that it uses all available data in the input to the network. As in [33], we have found that (15) performs well in practice when the network architecture includes a data consistency module: see Section IV. We emphasize that unlike Noisier2Noise, SSDU does not require the correction term (1 − K) −1 at inference. This implies that SSDU is less sensitive to inaccuracies in f θ * ( Y ), and we have found that SSDU outperforms Noisier2Noise in general: see Section IV.

F. K-Weighted SSDU
Since we train on a finite number of instances of the random variables Y , Y , Ω and Λ, the network parameters we obtain in practice, which we denoteθ, are an approximation of the ideal θ * from (12). In this case, the right-hand-side of (13) is not exactly zero. Rather, where E is a vector random variable. The vector E characterizes the difference between a true expectation and the network's estimate of it, which is non-zero for finite data. In other words, E is a statistical error due to finite sampling. The difference between the trained network's output and the expectation of To compensate for this, we propose minimizing the following weighted version of SSDU's loss as an alternative to (12): Introducing (1 − K) − 1 2 in the loss cancels the 1 − K in (16), so mitigates the error amplification caused by θ * approximation. We find that this version of SSDU, which we refer to as "K-weighted SSDU" throughout the remainder of this article, substantially improves the image restoration quality and robustness to training hyperparameters: see Section IV. We chose the power (1 − K) − 1 2 because it exactly cancels the 1 − K on the left-hand-side of (16) when the squared 2 loss is used; we also tried power (1 − K) −1 and found that, as expected, it did not perform as well in practice.

G. Understanding the Need for Correction
This section intuitively explains why Noisier2Noise requires correction at inference but SSDU does not. We can write the weighted loss as where we have used that the term is square brackets equals the identity matrix.
where we have used In (17) is SSDU's loss function (12) plus a contribution from all j ∈ Ω c t . Intuitively, the second term on the right-hand-side of (17) causes the proposed method to underestimate regions of k-space with index j ∈ Ω c j . This underestimation is compensated for the second term on the right-hand-side of (17) is zero, k-space is not underestimated anywhere, and there is no need for a correction term at inference.

A. Description of Data
We used the multi-coil brain and knee data from the fastMRI dataset [25], which is comprised of multi-channel raw k-space MRI data. The reference fastMRI test set data is magnitude images only, without fully sampled k-space data. Since we also require phase, we discarded the data allocated for testing and generated our own partition into training, validation and test sets. For the brain data, we only used data that was acquired on 16 coils, and used training, validation and test set sizes of 127, 19 and 14 volumes (2020, 302, and 224 slices) respectively. For the knee data, the training, validation and test sets consisted of 166, 19 and 14 volumes (5977, 665, and 493 slices) respectively. We set the network output to be zero in regions of k-space where the reference data had zero padding.

B. Network Architecture
For f θ , we used the variant of the VarNet [12] that estimates coil sensitivities on-the-fly [51], which performs competitively on the fastMRI leaderboard and is available as part of the fastMRI package. 1 After a coil sensitivity estimation module, VarNet uses multiple repetitions of a module based on gradient descent, which is comprised of a data consistency term in k-space and a prior based on a U-net [52] that acts in the image domain after an inverse Fourier transform and coil combination. The output of the neural network was in k-space. We used 6 repetitions of the main module, so that our model had around 1.5 × 10 7 parameters. Note that in [25], the Structural Similarity Index (SSIM) [54] was used as the loss, while in this article we use an 2 loss.
The only additional operations SSDU and Noisier2Noise require compared to fully-supervised training are simple entrywise masks, so all methods had similar memory requirements and training time. We trained for 50 epochs, which took around 17 hours on a GTX 1080 Ti GPU with 11 GB of RAM for the brain data. For all methods we used the Adam optimizer [55] with a fixed learning rate of 10 −3 . Our PyTorch implementation is publicly available on GitHub. 2

C. Distribution of Masks
So that the distribution of the sampling masks were known exactly, we generated our own masks rather than using those suggested in fastMRI. Unless stated otherwise, the distribution of the first mask M Ω was 1D column-wise. We fully sampled the central 10 columns and sampled the remainder with polynomial variable density. We used polynomial order 8, and scaled the probability density P so that it matched a desired acceleration factor. We ran each method with R Ω ∈ {4, 8}, where R Ω = N/ j p j is the expected acceleration factor. An example at R Ω = 4 is shown in Fig. 2(a).
In [41], it is suggested that the distribution of Noisier2Noise's second random variable is the same as the first, but not necessarily with the same distribution parameters. Therefore, for Noisier2Noise's second mask M Λ , we used the same type of distribution as M Ω with a different variable density. An example with R Ω = 4 and R Λ = N/ j p j = 2 is shown in Fig. 2(b). Concretely, we define two masks as having the same 'type' of distribution when the conditional dependence of the sampling set indices is the same. Let p j|k = P[j ∈ Ω|k ∈ Ω]. If p j|k = p j for all j and k, the entries are independent and the mask is the type '2D Bernoulli'. If p j|k = 1 when j and k are in the same k-space column and p j|k = p j otherwise, the mask is the type '1D column-wise'. The experiments in this article focus on these two types of masks; other types are discussed in Section V. We emphasize that constraining a mask to a type does not constrain the p j s, which define the variable sampling density.
To ensure that p j < 1 everywhere, we set p j = 1 − in the central 10 columns of k-space, where is a small real constant. The network architecture ensures that the central region is consistent with the input, so can be small without penalty. We used = 10 −3 . In order to be a realistic simulation of prospectively subsampled data, the sampling set Ω t must be fixed for all epochs. However, Λ t need not be. Therefore, we re-generated M Λ t from the distribution of M Λ once per epoch. Since the network sees more samples from the distribution of M Λ , the loss function is closer to (5), so fθ is expected to be a more accurate ap- This has similarities with training data augmentation, as each slice is used to generate several inputs to the network [56].

D. Comparative Methods
We trained Noisier2Noise using different weightings of the 2 loss stated in (8). For each self-supervised method, we considered two possible estimates at inference: one with the doubly sub-sampled y s as the network input and the other with the singly sub-sampled y s . The methods and their two estimates at inference are summarized in Table I. We trained with W = 1, referred to as "Unweighted Nois-ier2Noise". By Claim 1, Unweighted Noisier2Noise requires a (1 − K) −1 correction at inference: see Table I. We have found that the need for correction substantially reduces the image quality compared to SSDU, so do not recommend using Unweighted Noisier2Noise in practice. Nonetheless, we include some Unweighted Noisier2Noise results to illustrate the value of SSDU's loss weighting.
We also trained Noisier2Noise with W = (1 − M Λ )M Ω which, based on the relationship described in Section II-D, we refer to as "SSDU", despite some differences between our implementation and [33]. In [33], a mixture of an 1 and 2 loss was used, whereas here, so that it can be directly compared with Unweighted Noisier2Noise, we used an 2 loss. We also used a different M Ω distribution, dataset and network architecture to [33].
SSDU [33] was originally applied to an architecture that requires pre-computed sensitivity maps. It was suggested that M B t has a fully sampled 4 × 4 central region and 2D Gaussian variable density otherwise, so that high frequencies are sampled with higher probability. For the architecture considered in this article, which has a coil sensitivity estimation module, we found that increasing the size of the fully sampled central region considerably improved the method's performance. Since M Ω has 10 fully sampled central columns, we increased the size of the central region of M Λ to 10 × 10.
As the probability of sampling each location in k-space is independent, the sampling set partition proposed in [33] is To estimate their variable density distribution P we ran the SSDU authors' set partitioning code 3 1000 times on a fully sampled mask and averaged the result. We trained SSDU using a distribution of M Λ of this type, referred to as "2D partitioned SSDU", illustrated in Fig. 2(c). We also trained SSDU using the same distribution type of M Λ as M Ω , as in Fig. 2(b). We refer to this method as "1D partitioned SSDU", or "K-weighted 1D partitioned SSDU" when a (1 − K) − 1 2 weighting is used in the loss as described in Section II-F. Like Unweighted Noisier2Noise, M Λ t was re-generated once per epoch [56]. We emphasize that although 2D partitioned SSDU has a similar M Λ distribution as in [33], the distribution of M Ω here is random variable density columns, not equidistant columns as in [33]. Therefore, 2D partitioned SSDU is not necessarily expected to perform as well as SSDU in [33].
As a best-case target, we also trained using a fully supervised method with an (unweighted) 2 loss. All deep learning methods had the same network architecture and training hyperparameters, as described in III-B.
Finally, as a comparative method that does not use deep learning, we ran a compressed sensing algorithm with a sparse model on wavelet coefficients, which we implemented via the Berkeley Advanced Reconstruction Toolbox (BART) [57]. We used BART's default settings with fourth-order Daubechies wavelets and a sparse weighting of λ = 2 × 10 −3 .

E. Quality Metrics
To evaluate the reconstruction quality, we computed the Normalized Mean Squared Error (NMSE) in k-space on the test set: ŷ s − y 0,s 2 2 / y 0,s 2 2 . We also computed the imagedomain root-sum-of-squares (RSS),x s = ( c |F H y s,c | 2 ) 1/2 where y s,c is the k-space entries on coil c and F is the discrete Fourier transform, cropped the RSS estimate to a central 320 × 320 region and computed the SSIM, as suggested in fastMRI [25].

IV. RESULTS
For brevity, the results presented here focus on R Ω = 8. Similar results for the brain data at R Ω = 4 are shown in the supplementary material: see Figs. S1-S4.
For the brain data, we evaluated the dependence of the methods' performance on the distribution of M Λ by varying the parameters so that the sub-sampling factor R Λ changed. We trained 3 [Online]. Available: https://github.com/byaman14/SSDU Fig. 3. Mean test set NMSE percentage difference between fully supervised and each methods at R Ω = 8 and a 1D distributed M Ω , where R Λ has been tuned to minimize the test set NMSE. Fig. S1 shows a similar plot for R Ω = 4.  Table S1. with R Λ ∈ {1.2, 1.6, 2, 4, 6}, except for 2D partitioned SSDU, which we found needed finer tuning and a smaller R Λ for the best performance, so we trained with R Λ ∈ {1.1, 1.2, . . . , 2, 3, 4, 6}.

A. Performance With Tuned R Λ
This section focuses on the case where R Λ has been tuned to minimize the ground truth test set NMSE. Figs. 3 and S1 show bar charts of the percentage difference between fully supervised training and each method: (μ − μ full )/μ full where μ and μ full are the mean NMSE of interest and mean NMSE of fully supervised training respectively. The best performance was for K-weighted 1D partitioned SSDU with a y s input; its mean NMSE was only 1.1% and 0.8% larger than fully supervised for R Ω = 8, 4 respectively. Figs. 4 and S2 show box plots of the NMSE of each method for R Ω = 8 and R Ω = 4 respectively: see Table S1 of the supplementary material for the numerical values.
To evaluate whether the proposed changes to SSDU were statistically significant, we performed a one-sided Wilcoxon signed-rank test with p-value 0.01 on the test set NMSEs. For both the y s and y s inputs, we found that there was a significant statistical difference between 2D and 1D partitioned SSDU. We also found that the difference between 1D partitioned SSDU and K-weighted 1D partitioned SSDU was statistically significant.
Figs. 5 and S3 show RSS estimates from the test set at R Ω = 8 and R Ω = 4 respectively. Qualitatively, K-weighted 1D partitioned SSDU performs the most similarly to fully supervised training. Although 2D partitioned SSDU has a competitive quantitative score for the estimate with y s input, it exhibits some streaking artifacts.
Unweighted Noisier2Noise's performance was substantially worse than SSDU. Therefore we compare SSDU and its modifications only in the remainder of this article. Fig. 6. Dependence of the test set NMSE on the acceleration factor of the second mask M Λ , denoted as R Λ , at R Ω = 8 for both outputs. 1D partitioned SSDU is far more robust to the tuning of R Λ than 2D partitioned SSDU. Fully supervised learning does not use a second mask M Λ , so has the same performance for all R Λ . A similar figure for R Ω = 4 is in the supplementary material, Fig. S3. Fig. 7. Robustness to R Λ , where the blue box highlights the case where R Λ is tuned. K-weighted 1D partitioned SSDU is very robust to R Λ , with very similar restoration quality for all R Λ between 1.6 and 6. 2D partitioned SSDU is far more sensitive, with substantial degradation in image quality for mistunings as small as 0.1. Here, we show the estimate with y s input only.

B. Robustness to R Λ
For actual, prospectively sampled data, it would not be possible to tune R Λ on the ground truth test set NMSE. The practicality of SSDU therefore depends greatly on the robustness to R Λ . Figs. 6 and S4 show the dependence of the mean test set NMSE on R Λ for R Ω = 8 and R Ω = 4 respectively. K-weighted 1D partitioned SSDU was the most robust to the tuning of R Λ . 2D partitioned SSDU was the least robust, especially for the estimate with y s input. This is visualized in Fig. 7, which shows reconstruction examples for a number of R Λ s. K-weighted 1D partitioned SSDU performs very similarly for all R Λ s between 1.6 and 6, while 2D partitioned SSDU's restoration quality significantly degrades qualitatively and quantitatively for mistunings as small as 0.1.

C. Performance on 2D Sampled Brain Data
To further evaluate the role of the partitioning distribution, we also ran 1D and 2D partitioned SSDU on the brain data with a 2D Bernoulli sampled M Ω . In this case, the type matching of the second mask to M Ω is switched: 2D partitioned SSDU's second mask has the same type of distribution as the first, while 1D partitioned SSDU has a different type. For M Ω , we used a fully sampled 10 × 10 central region and a polynomial variable density that samples low frequencies with higher probability otherwise. We used R Λ = 1.2 and R Λ = 4 for 2D and 1D partitioned SSDU respectively. All other hyperparameters and network specifics were unchanged.
In this case, the best performance was 2D partitioned SSDU, which performed very similarly to fully supervised training: see Fig. 8. The y s input had a mean test set NMSE of 0.141 and 0.144 for 2D and 1D partitioned SSDU respectively, and the y s input had 0.141 and 0.145, compared with 0.139 for fully supervised training. Although not shown in Fig. 8 for brevity, we also trained 2D partitioned SSDU with a (1 − K) − 1 2 loss weighting. As for 1D partitioned SSDU in Section IV-A, we found that this reduced the mean NMSE further to 0.140 for both the y s and y s input.

D. Performance on 1D Sampled Knee Data
We also trained K-weighted 1D partitioned SSDU on the fastMRI knee data with the same network architecture, training hyperparameters, and a 1D distributed M Ω . The sub-sampling factor of the first and second masks were R Ω = 8 and R Λ = 2 respectively. The mean test set NMSE was 0.233 and 0.231 for the estimates with y s and y s inputs respectively, compared with 0.230 for fully supervised training. Fig. 9 shows two example reconstructions from the test set, demonstrating competitive performance with fully supervised training qualitatively.

V. DISCUSSION
Due to its need for correction at inference, Unweighted Nois-ier2Noise had consistently the worst score. We therefore do not recommend using Unweighted Noisier2Noise in practice. Rather, we suggest using a variant of SSDU, which has a loss weighting that removes the need for such a correction.
The hierarchy of 1D and 2D partitioned SSDU depends on the distribution of M Ω . In particular, the best performance was when they are both 1D or both 2D. It is conventional wisdom that better reconstruction quality is possible when k-space is randomly sub-sampled in both spatial dimensions (see, for instance, [58]). This is because the image-domain aliasing is incoherent in both dimensions, so is easier to remove. The superior performance of 1D partitioned SSDU compared with 2D partitioned SSDU when M Ω is 1D shows that it is not necessarily true that the sampling set partition should also ideally be two-dimensional. Rather, better performance is possible when the distribution of M Ω and M Λ M Ω are of the same type.
To see why, consider the nature of the aliasing caused by sub-sampling and further sub-sampling k-space, focusing on the example of a random 1D column sampled M Ω . Such sampling causes the image-domain aliasing to be horizontally incoherent and vertically coherent. With a 1D column-wise Λ t , further horizontal aliasing is introduced. Since the network cannot distinguish between the horizontal aliasing caused by Ω t or Λ t , the loss is minimized when the aliasing due to both is removed. On the other hand, a 2D Λ t introduces some aliasing that is orthogonal to the original aliasing, which is distinguishable in principle. In this case, the loss is minimized when the network removes the aliasing caused by Λ s , but not necessarily the original aliasing caused by Ω s . This is visible in Figs. 5 and 8, where SSDU fails to completely remove artifacts caused by M Ω when M Λ does not have the same type of distribution.
This implies that, in general, better performance is possible when the distribution of the aliasing of y t and y t are of the same type. For both the independent 1D column sampling and 2D Bernoulli sampling considered here, this can be achieved by choosing a M Λ with the same type of distribution as M Ω . Recently, in [59], this was also observed empirically for SSDU with random spoke sampling. However, such a procedure does not always achieve this goal. For instance, while the SSDU paper [33] considers a fully sampled central region and equidistant column sampling, recovery of images with regular undersampling is not currently considered in the proposed framework. In this case, a Λ t of the same type would not give a y t with the same aliasing type as y t . The 2D Gaussian variable density partition employed in this article was originally constructed to handle such sampling patterns, and was found to perform very well in this context. Future work includes establishing the correct sampling set partitions for M Ω distributions not in [33] or covered by the approach suggested here.
We found that K-weighted SSDU further improved the image quality and robustness to R Λ . Consider the jth entry of the (squared) weighting (1 − K) −1 in terms of sampling probabilities: .
This leads to the following intuitive interpretation of the proposed loss weighting as compensation for the variable density of Ω and Λ. A smaller denominator P(j ∈ Ω \ Λ) implies that the jth location occurs less frequently in the loss, which is compensated for by an increased weighting. A smaller numerator P(j / ∈ Λ ∩ Ω) implies that the jth location is estimated by the network less frequently, so has a decreased weighting.
The benefit of the (1 − K) −1 weighting highlights and addresses a general challenge of self-supervised learning with variable density sampling: regions of k-space sampled with lower probability are underrepresented in the loss. This issue has been noted in other works. For instance, for variable density reconstruction with Noise2Noise, [60] suggests weighting the loss function by the sampling density. An alternative approach was suggested in [61], which suggests passing the training target through the network before it is employed in the loss function. We note that if the sampling and partitioning had uniform density, such as in [56], K would also be uniform, so the proposed weighting would not be required. This may explain in part the empirical performance observed in [56].
When M Ω was 1D, with the exception of 2D partitioned SSDU, Fig. 6 shows that the estimate with y s input performed similarly or better than with y s input when R Λ is tuned. This indicates that, for these methods, the advantage of using all the data in the input to the network outweighs the disadvantage that the input data has a different sampling distribution to the training data so is not guaranteed by Claim 1 or 2 to be correct in expectation. Heuristically, when M Ω and M Λ M Ω are both variable density column-wise sampled, a network trained on doubly sub-sampled data is likely to also be able to handle singly sub-sampled data. However, for 2D partitioned SSDU, M Λ M Ω is no longer column-wise, see Fig. 2(c). Accordingly, 2D partitioned SSDU was the only method that had a higher NMSE for the y s input compared to the y s input.
The best R Λ for 2D partitioned SSDU was lower than competing methods: R Λ = 1.8 and R Λ = 1.2 for the y s and y s inputs respectively. In [33], the sampling set partition was quantified in terms of the ratio ρ = |A t |/|B t |, and it was found that ρ = 0.4 offered the best performance. Since the M Ω distributions are different here, the optimal ρ is not expected to necessarily be the same. For 2D partitioned SSDU R Λ = 1.8 and R Λ = 1.2 corresponds to ρ = 0.52 and ρ = 0.21 respectively, while for the other methods's best performance at R Λ = 4 corresponded to ρ = 0.57. Therefore the ρ were reasonably similar despite the substantial difference in R Λ .
Since the network architecture uses y t in its coil sensitivity estimation module, not y t , it is plausible that the differences between 1D and 2D partitioning could be due to poorer coil sensitivity estimation rather than an intrinsic property of the partition change. To examine this, we re-trained tuned 1D and 2D partitioned SSDU on the 1D sampled brain data with kspace masked to a central 10 × 10 region in the coil sensitivity estimation module. We found that the test set NMSE was within 1% of the usual approach. This verifies that the performance improvement was indeed a consequence of the partition change, not simply a consequence of specifics of the architecture.
Unweighted Noisier2Noise's correction at inference (1 − K) −1 is only valid when an 2 loss is used; we have found that other loss functions do not perform well in practice. This loss leads to smoothing artifacts, even for fully supervised training. For SSDU, since there is no correction term, loss functions other than 2 are possible. For instance, in [33], a mixture of 2 and 1 was used. Better visual quality may be achievable when SSDU is implemented with a different loss; we do not suggest using an 2 loss in general, it is only required here so that it can be compared directly with Noisier2Noise.
For all self-supervised methods in this work, we re-generated Λ t once per epoch. This has similarities to the multi-mask SSDU approach proposed in [56]. However, in [56], a fixed number n Λ of Λ t s were generated for each Ω t , each of which were treated as an additional member of the training set. Therefore, unlike in this article, each epoch was n Λ times as long. Future work includes establishing whether it is also advantageous to limit the number of unique Λ t s per Ω t for the approach considered in this article.
All methods in this article were trained without taking measurement noise into account [62], [63]. Recent work by the present authors has shown that the additive and multiplicative versions of Noisier2Noise can be combined to recover higher fidelity images than SSDU in the presence of noise [64].

VI. CONCLUSIONS AND FUTURE WORK
Based on the observation that SSDU is a version of Nois-ier2Noise with a particular rank-deficient loss weighting, we proved that SSDU correctly estimates Y 0 in expectation. This analysis led to two proposals that we found significantly improved SSDU's performance in practice. Firstly, we propose employing a distribution of M Λ M Ω that is the same type as the original mask M Ω . Secondly, we propose introducing a weighting of (1 − K) − 1 2 in SSDU's loss. We found that that each of these modifications significantly improved SSDU's test set NMSE and robustness to R Λ .
There are a number of other self-supervised learning methods that also use sampling set partitioning [37], [56], [65], some of which are variants of SSDU. For instance, [37], [65], [66] propose training two networks in parallel, one for each sampling subset, with a loss function that includes the difference between the outputs of the two networks. Another recent development is zero-shot SSDU [67], which shows that sampling set partitioning can also be applied to recover images without a training dataset [68]. Future work includes determining whether the theoretical and practical developments of this article can be extended to these methods.

A. Proof of Variable Density Noisier2Noise
This section of the Appendix proves that when p j = 0 and p j = 1 for all j, Proof: This proof is based on Section III-D of Nois-ier2Noise [41], but with more mathematical detail and generalized to variable density sampling. Following the compressed sensing literature, this article uses p j to refer to the probability that the jth location in k-space is sampled. This differs to [41], which uses p to denote the probability that a pixel is zeroed. We To do this, we split E[Y j | Y j ] into two cases, for conditions Y j = 0 or Y j = 0, and subsequently construct an expression that is consistent with both.
where we define k j = P[Y j = 0| Y j = 0]. Evaluating each of the terms on the right-hand-side of (20) in turn: conditionally zero, its expectation is also zero: The measurement model implies that when Y j is non-zero, and therefore unmasked, it takes the value of Y 0,j . Therefore its expectation can be written in terms of the expectation of Y 0,j : r k j : By the definition of conditional expectation: The numerator is is the probability that j ∈ Ω. By the partition theorem, the denominator is Therefore Substituting the above results into (20) gives where k j is defined in (22).

Combining Cases 1 and 2. (E[Y j | Y j ]):
To find E[Y j | Y j ], one must construct an expression that holds for both (19) and (23). Consider the following candidate: This expression can be verified as consistent with (19) by setting Y j = 0: as required. Secondly, setting Y j = 0 gives as required by (23). Therefore (24) is consistent with both (19) and (23), so is a correct expression for E[Y j | Y j ]. When 1 − k j = 0 we can rearrange (24) By the expression for k j given in (22), 1 − k j is so is non-zero when p j = 0 and p j = 1. Writing (25) in terms of vectors and matrices yields (18), as required.

B. Proof of SSDU
This section of the Appendix proves that a network trained with SSDU's loss weighting (1 − M Λ )M Ω satisfies Proof: By (6), the minimum of SSDU's loss function (12) gives a function that satisfies Similarly to Section A of the Appendix, the following derives under two conditions, Y j = 0 and Y j = 0, and subsequently find an expression that is true for both. In the following, m j and m j are the jth diagonals of M Λ and M Ω respectively.
: When Y j = 0, the jth entry is not masked: m j = 1. Therefore (1 − m j )m j = 0 and the expression is zero: : When Y j = 0, m j m j = 0, so (1 − m j )m j = m j . Therefore As for Case 2 of Section A of the Appendix, we can use the partition theorem to express (29) as a weighted sum: where k j = P[Y j = 0| Y j = 0] as in Section A of the Appendix, given in (22). Taking each term in turn: when it is zeroed by the mask, m j = 0. Therefore is not zeroed by the mask, so m j = 1: Further, since Y j = Y 0,j when Y j = 0 by the measurement model, Substituting the above results in to (30) gives Combining Cases 1 and 2: A correct expression for E[(1 − m j )m j (f θ * ( Y ) j − Y j )| Y j = 0] must be true for both Case 1 and 2, so consistent with both (28) and (31). Consider the candidate Equation (32) is consistent with (28) because (1 − m j m j ) = 0 when Y j = 0, and consistent with (31) because (1 − m j m j ) = 1 when Y j = 0. Using the vector form of (32) and setting as required.
ACKNOWLEDGMENT For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.